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Abstract 

A recent method for causal discovery is in many cases able to infer whether X causes 
y or F causes X for just two observed variables X and Y . It is based on the observation 
that there exist (non- Gaussian) joint distributions P{X, Y) for which Y may be written as 
a function of X up to an additive noise term that is independent of X and no such model 
exists from Y to X. Whenever this is the case, one prefers the causal model X ^Y . 

Here we justify this method by showing that the causal hypothesis Y ^ X is unlikely 
because it requires a specific tuning between P{Y) and P{X\Y) to generate a distri- 
bution that admits an additive noise model from X to Y . To quantify the amount of 
tuning required we derive lower bounds on the algorithmic information shared by P{Y) 
and P{X\Y). This way, our justification is consistent with recent approaches for using 
algorithmic information theory for causal reasoning. We extend this principle to the case 
where P{X, Y) almost admits an additive noise model. 

Our results suggest that the above conclusion is more reliable if the complexity of 
P{Y) is high. 

1 Additive noise models in causal discovery 

Causal inference from statistical data is a field of research that obtained increasing inter- 
est in recent years. To infer causal relations among several random variables by purely 
observing their joint distribution is unsolvable from the point of view of traditional statis- 
tics. During the 90s, however, it was more and more believed that also non-experimental 
data contain at least hints on the causal directions. The most important postulate that 
links the observed statistical dependencies on the one hand to the causal structure (which 
is here assumed to be a DAG, i.e., a directed acyclic graph) on the other hand is the 
causal Markov condition [T^. It states that every variable is conditionally independent 
of its non-eff'ects, given its causes. If the joint distribution P{Xi, . . . , A„) has a density 
p{xi, . . . ,Xn) with respect to some product measure, then the density factorizes [81 into 

n 

p{xi,...,Xn) = J|p(a;j|paj) , 
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where p{xj\paj) denotes the conditional probabihty density of Xj, given the values puj of 
its parents PAj. 

The Markov condition already rules out some DAGs as being incompatible with the 
observed conditional dependencies. However, usually a large set of DAGs still is compat- 
ible. In particular, for n variables, there are nl DAGs that are consistent with every joint 
distribution because they do not impose any conditional independence. They are given 
by defining an order Xi, . . . ,X„ and drawing an error from Xi — s- Xj for every i < j. 
For this reason, additional inference rules are required to choose the most plausible ones 
among the compatible DAGs. Spirtes at al. [TB] and Pearl [13] use the causal faithfulness 
principle that prefers those DAGs for which the causal Markov condition imposes all the 
observed mdependencies. In other words, it is considered unlikely that independencies are 
due to particular (non-generic) choices of the conditionals p(xj\paj). The underlying idea 
is, so to speak, that "nature chooses" the conditionals independently from each other, 
while the generation of additional independencies (that are not imposed by the structure 
of the DAG) would require to mutually adjust these conditionals. A more general per- 
spective on such an independence assumption has been provided by Lemeire and Dirkx 
[9] who stated the following principle: 

Postulate 1 (Algorithmic independence of conditionals). 

// the true causal structure is given by the directed acyclic graph G with random variables 
Xi, . . . , Xn as nodes, the shortest description of the joint density p{xi, . . . ,x„) is given 
by separate descriptions of the conditional^p{xj\paj). 

In [9J the description length has been defined in terms of algorithmic information, 
also called "Kolmogorov complexity" (the details will be explained in Section l2| . There 
the postulate is mainly used to justify the causal faithfulness assumption [T6] T since it 
rules out mutual adjustments among conditionals like those required for unfaithful dis- 
tributions. However, in [6] it has been argued that the complete determination of the 
joint distribution is never feasible which makes it hard to give empirical content to it. 
Moreover, [B] shows that Lemeire and Dirkx's principle can be seen as an implication of a 
general framework for causal inference via algorithmic information. There, the postulate 
is rephrased in a way that avoids the complexity of conditionals and uses only empirical 
observations. Furthermore, the general framework imposes many causal inference rules 
yet to be discovered. Here we focus on a method [5] that yielded quite encouraging results 
on real data sets and show that it also can be justified via algorithmic information theory. 
We briefly rephrase the idea of [5] for the special case of two real- valued variables X and 
Y. To this end we introduce the following terminology: 

Definition 1 (Additive noise model). 

The joint density p{x, y) of two real-valued random variables X and Y is said to admit an 
additive noise model from X to Y if there is a measurable function f -.M. —^ M. such that 

Y = f{X)+E, (1) 

where E is some unobserved noise variable that is statistically independent of X . The 
joint density thus is of the form 

p{x, y) = Px{x)pE{y - f{x)) , 
where px{x) is the density of X and pE{e) the density of E. 

Whenever this causes no confusion, we will drop the indices and write p{x) instead of 
Px{x) and, similarly, write p(y — f{x)). We will write px if we want to emphasize that 
we refer to the entire density and not one specific value p{x) . 

^For sake of simple terminology, we also consider the density p{xj) of parentless nodes as a "conditional", 
given an empty set of variables. 
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It can be shown 5 that for generic choices of /, distribution of the noise, and distri- 
bution of X, there is no additive noise model from Y to X. In other words, if causahty in 
nature would always be of the form of additive noise models (which is certainly not the 
cas^, we could almost always identify causal directions because a joint distribution that 
admits an additive noise model in the true direction usually does not admit one in the 
wrong direction. This paper addresses the question whether a causal structure Y ^ X 
that is not of the form of an additive noise model could induce a joint distribution that 
admits an additive noise model in the wrong direction (i.e., from X to Y). The basic 
observation of this paper is that this would be a rare coincidence because it requires that 
Py (which would be the distribution of the cause) and the transition probabilities Px\Y 
(which describes the effect generating the relation between cause and effect) satisfy an un- 
typical relation that makes this scenario unlikely. However, instead of deriving probability 
values for such a coincidence (which required to assign priors on probability distributions) 
we will take a non-Bayesian view and follow the algorithmic information theory approach 
developed in ^6^ and [9^ . The following lemma makes explicit what kind of coincidence is 
meant: 

Lemma 1 (Relation between py and Px\y)- 

Let p(x, y) he positive definite and let f as well as all logarithms of marginal and condi- 
tional densities be two times differentiate. Ifp{x,y) admits an additive noise model from 
X to Y, then the marginal p{y) and the conditional p(x\y) are related via the differential 
equation 

1 

logp(?;) = - --^\ogp{x\y) - — — logp(a;|y) . (2) 



dy^ dy^ f'{x) dxdy 



Hence we have 



rv rv q2 I Qi 

^ogp{y) ^ - J J -^\ogp{x\y')- jyj-^-^^^\ogp{x\y')dy'dy'' + ay + b, 

where b is determined by J p{y)dy = 1. Since the equation has to be valid for all x, we can 
choose an arbitrary xg with f'{xo) ^ 0. Then py can already be determined from /'(O), 
the function y ^ v[.x^\v) and a. Given the conditional Px\Y^ the tupel (xo,/'(a;o)) and 
a are sufficient to describe the marginal py . In general, these are much fewer parameters 
than those required for describing py without knowing px\Y- This already suggests that 
py and px\Y have algorithmic information in common because knowing px\Y shortens 
the description of py . 

However, assume we know that pxY belongs to the family of bivariate Gaussians. 
Then it admits an additive noise model in both directions and both causal directions 
are possible. This is consistent with the fact that our argument above fails in this case 
because a and f'{xo) then coincides with the information that also would be required to 
describe py without knowing px\Y- To see this, set 

logp(a; = , 



where = denotes equality up to a term that neither depends on x nor on y. Furthermore, 
let 



logp(yF) = ^"2 



2al 

with the notation c := f'{xQ). We then get 

logp(y) = ^,„^_2 , — 2" 



XLt discusses an interesting generalization. 
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Hence, 

logp(a;|y) = 

which imphes 



log p{x\y) = --^ =: a, 



dxdy 

and ^ 

— logp(x|y) ± 

The constants a and /3 can be derived from observing p{x\y), but to determine the second 
derivative of logpy one needs to know c since eq. Q imposes 

^^\ogp{y)=P-\a. (3) 
To determine py completely, we also need to know the first derivative 



dy 



logp(t/ = 0)- 2 



if denotes the mean of Y. Moreover, we observe that c specifies the standard deviation 
(Ty of Y because the left hand side of eq. ^ is given by — l/dy. This shows, that given 
Px\Yj we still need to describe the two parameters fiy and ay. These are exactly the two 
parameters that describe the Gaussian py also without knowing px\Y- Hence, knowing 
Px\Y is worthless for the description of py. 

The intuitive arguments above show that knowing Px\Y makes the description of py 
shorter except for some rare cases where py already has a short description. Formal 
statements of this kind, however, require the specification of the accuracy up to which py 
and a are described. 

The paper is structured as follows. In Section [2] we briefly rephrase algorithmic in- 
formation theory based causal inference as developed in |5] . In Section |3] we show that 
additive noise models from XtoY induce densities py and px\Y that have algorithmic in- 
formation in common. In Section^Jwe consider additive noise models over flnite fields and 
show that py and px\Y also share algorithmic information if the distribution is only close 
to an additive noise model from X to Y. Since our bounds on the information shared by 
these objects depend on the Kolmogorov complexity of py (which cannot be determined) 
we discuss a method to estimate the latter in Section [Sj Section [6] and Section |7] discuss 
how to apply the insights gained from the discrete case to empirical and to continuous 
distributions respectively. 



2 Algorithmic information theory and the causal prin- 
ciple 

Reichenbach's Principle of Common Cause [14j is meanwhile the cornerstone of causal 
reasoning from statistical data: Every statistical dependence between two random vari- 
ables X and Y indicates at least one of the three causal relations (1) "X causes F", (2) 
"Y causes X", or (3) is a common cause Z influencing both X and Y. As an extension of 
this principle, we have argued [5] that causal inference is not always based on statistical 
dependencies. Instead, similarities between single objects also indicate causal links (e.g., 
if two T-shirts produced by different companies have the same sophisticated pattern we 
would not believe that the designer came up with the patterns independently). We have 
therefore postulated the "causal principle" stating that there is a causal link between two 
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objects whenever the joint description of them is shorter than the concatenation of their 
separate descriptions. 

To formalize this, we first introduce some concepts of algorithmic information theory 
[TT] . Let s, t be two binary strings that describe the observed objects and let K{s) denote 
the algorithmic information (or "Kolmogorov complexity"), i.e., the length of the shortest 
program that generates s on a universal Turing machine [71 1151 [21 IT . Let K(s\t) denote 
the length of the shortest program that generates s from the input t. Then we define [4]: 

Definition 2 (algorithmic mutual information). 

Let s,t be two binary strings. Then the algorithmic mutual information between s and t 
reads 

I{s : t) := K{t) - K{t\s*) ± K{s) + K{t) - K{s, t) , (4) 

where s* denotes the shortest program that computes s and K{s,t) is the length of the 
shortest program generating the concatenation of s and t. 

As usual in algorithmic information theory, all (in)equalities are only understood up to 
a constant that depends on the Turing machine [llj . For this reason, we write = instead 
of ~. Since s can be computed from s* , but usually not vice versa, we have 

K{t\s*) < K{t\s) . (5) 

We will later also need the conditional version of g, see g]: 
Definition 3 (conditional algorithmic mutual information). 

Let s,t,v be binary strings. Then the conditional algorithmic mutual information reads 

I{s : t\v) K{t\v) - K{t\s, K{s\v),v) ± K{s\v) + K{t\v) - K{s, t\v) . (6) 

Eq. ^ is formally similar to the statistical mutual information 

I{X : Y) := H{Y) ~ H{Y\X) = H{X) + H{Y) - H{X,Y) , 

phrased in terms of the Shannon entropy H{-). Reichenbach's principle can then be 
rephrased as: 

"I{X : Y) > indicates that there is at least one of the three possible causal 
links between X and Y." 

In analogy to this principle, we have postulated in [B]: 

Postulate 2 (Causal Principle). 

Let s and t be binary strings that formalize the descriptions of two objects in nature. 
Whenever 

/(s : i) > 0, 

there is a causal link between the two objects s and t in the sense that s ^ t or t s or 
there is a third object u with s ^ u ^ t. 

Here, it is up to the researcher's decision how to set the threshold above which a 
dependence is considered significant. This is similar to setting the significance value in a 
statistical test. 

Note that the condition K{t) — K{t\s) 3> implies I{s : i) ^ due to ineq. (H). We 
will work with the former condition since it is easier to test. 

To interpret Postulate [l] as a special case of Postulate [2] we consider the following 
model [6] of a causal structure X ^ Y for two random variables X and Y . Take as the 
two objects in nature a source S that generates x-values according to p{x) and a machine 
M that takes x- values as input and generates y- values according to p{y\x) (see Figure [ij. 



5 




Figure 1: Causal structure obtained by resolving the causal structure X ^ Y between the random 
variables X and Y into causal relations among single events 

If S and M have been designed independently, their optimal joint description should be 
given by separate descriptions of S and M. However, the only feature of 5* that is relevant 
for our observations is given by the distribution of x- values, i.e., px- Similarly, py\x is the 
only relevant feature of M . These features are directly given by observing the x and the y~ 
values after infinite sampling. We therefore consider the algorithmic dependencies between 
Px and Py\x- Since the objects of our descriptions will be probability distributions, we 
introduce the following concept: 

Definition 4 (computable functions and distributions). 

Let S denote some subset ofM.''. A function f : S ^ M. is computable if there is a program 
that computes f{x) up to a precision e > for every input {x,e), for which x has a 
finite description. Then K{ f) denotes the length of the shortest program of this kind. A 
probability distribution on a finite probability space S is called computable if its density is 
a computable function. 

In the following section we apply the concepts introduced above to the case of strictly 
positive continuous densities p{x, y). 

3 Algorithmic dependencies induced by additive noise 
models 

We have already argued that an additive noise model from X to Y makes the causal 
structure Y ^ X unlikely because py and Px\Y then satisfy the non-generic relation of 
eq. We now express this fact in terms of algorithmic information theory: 

Theorem 1 (algorithmic dependence induced by an additive noise model). 
Let p{x,y) be a two-times differentiable computable strictly positive probability density 
overM?. Lfp{x,y) admits an additive noise model from X to Y with a computable differ- 
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entiable function f , then 



I{PY ■■ Px\y) > K{py) ~ K{yo, V'(2/o)) - K{xo, /'(xq)) 
where xq and j/o are arbitrary computable x- and y-values, respectively and ^^{y) 



Proof: Eq. ([2| expresses the second derivative (logpy)" in terms oi px\Y /'(xq). 
Hence, 

K{{\ogpY)"\Px\Y) < K{xo, f'ixo)) . (7) 

We have by definition 

I{py ■ Px\y) - K{py) - K{py\p*x\y) ^ K{py) - K{py\px\y) ■ (8) 

The density pY is aheady determined by (logpy)" and the first derivative V''(2/o) for some 
2/0 because fogpy(2/o) then follows from normalization. Therefore, 

K{py\z) ± Ki{logpY)"\z) + K{ij{yo)\z) , 

where z is some arbitrary prior information. Using z — px\Yi the right hand term of 
ineq. (|8| yields 

I{py-Px\y) > K{py) ~ K{{logpY)"\px\Y) - K{yo,ip'{yo)\Px\Y) 

> K{py) - K{xo, f'{xo)\px\Y) - K{ya,ij'{yo)\px\Y) 

> K{py) - K{xo, f'ixo)) - K{yo, ^'(yo)) , 
where the second inequality is due to ineq. Q. □ 

The interpretation of Theorem [T] raises two problems: First, we cannot determine the 
exact "true" probabilitie^ from the observations, and second, we do not expect these 
probabilities to be computable, and hence it required an infinite amount of information 
to describe py and px\Y if we could. As already pointed out in [6], algorithmic depen- 
dencies among the empirical distributions qy and qx\Y after finite sampling do not show 
algorithmic dependencies between S and M. For continuous variables, this is already 
obvious from the fact that the conditional distribution of X, given Y, is only defined for 
the support of qy ■ If the true distribution is a density, the empirical distribution contains 
every y- value only once and knowing the support of qy thus already implies knowing qy- 

To circumvent this problem, we will in the following section consider additive noise 
models over a finite probability space. Within this setting, we derive statements on dis- 
tributions that are close to additive noise models. Since the finite case has the advantage 
that empirical frequencies converge pointwise to the true probabilities, this result also 
implies statements for the corresponding empirical distribution. 

4 Stronger statements in finite probability spaces 

The following theorem is a modification of Theorem [l] for additive noise models over the 
finite field Z,„ for some prime number m. 



'it is, anyway, a philosophical problem to what extent they are well-defined. 
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Theorem 2 (Algorithmic information between py and px\Y for the discrete model). 

Let px,Y be a computable strictly positive distribution on for some prime number 
m that admits an additive noise model, i.e., there is a function f : — > such that 
E — fix) and X are statistically independent. Here, subtraction is understood with 
respect to Z^. Then, if f is non-constant, we have 

+ 

I{PY ■■ Px\y) > K{py) - 2 log m. (9) 

Proof: The idea is, again, to derive an equation that shows that py is essentially deter- 
mined by px\Y up to some small amount of additional information. We have 

logp{x,y) = logpx{x) +logpE{y ~ f{x)) . 

Defining S :~ f{xQ + 1) — /(a;o), for some xq for which 5 0, we introduce 

k(x\y) = ^ogp{x - l\y) ~logp{x - l\y - S)+logp{x\y) - logp{x\y + S) , (10) 

which yields the equation 



logp(y + S)~ \ogp{y) = k(^^\y) + logp(y) - \ogp{y - 5). (11) 

We interpret eq. ([TT| as a discrete version of eq. ^ because it relates differences between 
the values logp(?/) at different points y to the quantity which is a property of the 

conditional I y alone. Eq. (11) implies for arbitrary j/q 

logp(yo + [j + 1)'5) - logp(yo + 36) = logp{yo + .jS) - logp{yo + {j - 1)'5) + k(x„iy+js) , 

for all j — 1, . . . , m. Writing log py for the vector with coefficients logp(yo + {j + l)S) and 
k for the vector with coefficients ^(xoly+ja) for j = 0, . . . , m — 1, we rewrite eq. ( 11 1 as 

{S~lflogpy^k, 

where S denotes the cyclic shift in dimension m. Using the fact that {S — /) is invertible 
on the space of vectors with zero sum of coefficients, we thus obtain 

logpy = [S - I)-^k + ae, (12) 

where a is given by normalization and e is the vector with only ones as entries. This 
shows that xq, 6, and px\Y determine py. Denoting i := {xq,5) we can summarize the 

above into K{py\px\y ,i) — 0. This implies 

K{py\pxiy)<K{t), 

because 

K{py\px\Y) - K{py\px\y,i) = K{py\px\y) ~ K{py\px\y , K{i\px\Y) , i) 

= 1{py ■■ Apx\y) < K{i) , 

where the second equality is due to the definition of conditional algorithmic mutual in- 
formation (|6|. □ 

We want to derive a similar lower bound for the case where pxY almost admits an additive 
noise model. To this end, we first define a precision dependent Kolmogorov complexity of 
a probability distribution: 



8 



Definition 5 (Precision dependent algorithmic information). 

Let p be a density on finite probability space. Let r be a computable probability density and 
K{r) be the length of the shortest program on a universal Turing machine that computes 
r(x) from x. Then 

K^{p) := min K{r\e) , 

r with D(p||r)<e 

where D{-\\-) denotes the relative entropy distance. Similarly, we define the conditional 
complexity K^{p\i) given some prior information i. 

If q is an arbitrary approximation of a distribution p in the sense that | log p{x) — 
log q{x) I < e holds for all x, then ^(pl jq) < e and thus the precision dependent algorithmic 
information can be bounded from above by the complexity of the approximation: Kf^{p) < 
K{q). For computable p, we obviously have 

\im K,{p)^K{p), 

e— >0 

but for uncomputablc p, the complexity tends to infinity. The following lemma shows the 
empirical content of precision-dependent complexity: 

Lemma 2 (precision-dependent complexity of empirical distributions). 

Let p be a positive definite distribution on a finite probability space and g^"-' be the 
empirical distribution after n-fold sampling from p. Then 

lim if,(g(")) = i^,(p), 

n — *oo 

with probability one. 

Proof: Let r be a distribution for which K^{p) = K{r) and D{p\\r) < e. due to 
D{q'^"'^\r) D{p\\r) with probability one and because of the continuity of relative en- 
tropy for positive definite distributions we also have I?(g''"'||r) < e for all sufficiently large 
n. Hence K^{q^"'>) < K,{p). 

To prove that -ft:£(g(")) > K^{p), let r(") be a sequence of distributions such that 
ii',(g(")) = Js:(r(")) and D(g(") | |r(")) < e. Hence, D{p\\r'^"^) < e for sufliciently large n 
which completes the proof. □ 

The following lemma will later be used to derive a lower bound on I{py '■ Px\y) in terms 
of K,{py): 

Lemma 3 (mutual information and approximative descriptions). 

Let p be a computable distribution on a finite probability space, z an arbitrary string and 
e > computable. Let q be a distribution that is e-close to p, i.e., 

D{p\\q)<e. (13) 

// q can be derived from z and from p in the sense that 

K{q\p,tp)^K{q\z,2,)^0, (14) 

for additional strings ip and i^, then 

I{p:z)>K,{p)-K{ip)-K{i,). 
Proof: Using the definition of conditional mutual information ^ we get 

l{q ■ ip\p) = K{q\p) - K{q\ip, K{ip\p),p) ± K{q\p), 
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because Eq. (14l implies K{q\ip, K{ip\p),p) = 0. On the other hand I{q : ip\p) < K{ip) 
and therefore 

K{q\p) < K{ip). 

In the same way, eq. ( |l4| imphes K{q\z) < K{iz). A data processing inequahty (Corrolary 
II. 8 in [4J) then imphes 

I{p : z) > K{q) - K{ip) - K{i,) . 



We conclude with K^{p) < K{q) due to ineq. (13). □ 
We will moreover need the following Lemma: 

Lemma 4 (bound on the differences of logarithms). 

Given a vector v G K™, we define a probability distribution by 

Pj - f e-"^ , 

where Zy is the partition function. Let p be defined by v in the same way. Then 

\logpj - logpjl < 2\\v - i\\oo ■ 

Proof: Due to 

log Pj - log Pj = Vj - Vj - log Zy + log Zy 

we only have to show 

|l0gZ„ - l0gZy\ < \\V - v\\oo ■ 

To this end, we define 

logz(e) := logz„+e(5-i.) • 
Using the mean value theorem we have for an appropriate value rj G (0, 1) 

log Zy- log Zy = log2(l) - logz(O) 
= (logzYiTj) 

The last expression is the expected value of Vj — Vj with respect to the probability distri- 
bution corresponding to v + rj(y — v), which cannot be greater than \\v — v\\oo- D 

We now have introduced the technical requirements to formulate a theorem for approxi- 
mate additive noise models: 

Theorem 3 (approximate additive noise model). 

Letpxy be as in Theorem^ but only admitting an approximative additive noise model 
in the sense that 

where (3 is a lower bound on p{x,y). Here, subtraction is understood with respect to Z^. 
Then, if f is non-constant, we have 

I{PY ■■Px\Y)>K,{pY)-2logm-m-2K{e). (16) 
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Proof: The idea is to define a distribution px.Y that is close to px.Y and admits an exact 
additive noise model: Define a joint distribution on X and E by the product 

Px,E ■= Px®Pe- 

By variable transformation, px,E defines a distribution px,Y that admits an additive noise 
model from X to Y . Eq. now holds for px\Y and py with instead of fc(a;o|j/)i 

which is defined similar to eq. ( |lO| . Denote the corresponding vector by fc = (fc(a;o|j/))j/- 
In analogy to eq. (12 1 and the proof of Theorem |2] we now have 

\ogpY = [S - ly'^l + ae , 

where a is the appropriate normalization constant and e the all-one vector. To show that 
Px\Y allows an approximative description of py we have to replace k and py with k and 
py, respectively. We define 

log ry := {S - ly^k + ae , 

and using Lemma |4] we obtain 

\\\QgPy -\0gry\\oo < 1 1 log py - log 1 1 oo + || log Py - log || 

< ||logpy-logpy||oo + 2||(^-/)-2(fc-fc)|U. (17) 

The modulus of the eigenvalues of {S — I)^^ on this subspace are all smaller than m/4 
(for m > 2) since they read 

11 1 



g2'7r i/m „ ]^ ' ^2x2 2/m ^ ' ' ' ' ' ^27^1 {m—l) / m „ ^ 

We thus have 

9 '\ 
\\{S~I)-\k-k)h < ^l|fc-fc||2 < ^||fc-fc|| 



where the last inequahty used || • ||2 < V^ll ' lloo- Together with || • ||oo ^ |1 • II2, nieq. (17) 
then yields 

TTl ~ 

II logpy - logrylloo < || logpy - logpy||oo + "^11^ " ^lloo' (1^) 

Now we derive an upper bound on the two summands of the rhs. using our assumption 
on the limited statistical mutual information between X and E. To this end, we observe 
that 

DipxM\px,Y) = DipxMlPx.E) = nx : E) , (19) 

where the first equality is due to the invariance of relative entropy under variable trans- 
formation and the second uses a well-known reformulation of mutual information [3]. 
Moreover, we have 

d{px\y\\px\y) = Xl^(^'-^iwllp-^iaMy) - 2 ( W ) ' 

y ^ ^ 

where Px\y denotes the conditional distribution for one specific value y of y. Using the 
lower bound on p(y) we obtain 



1 / 6/5 \ ^ 
DiPx\y\\PX\y)<l^(^^j VZJ, 
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Due to the well-known relation D{p\\q) > (21n2) — between relative entropy and 
^i-distance for two distributions |,3,, we obtain 

ef3 



This implies 



\pix\y) -p{x\y) \ < 
\logp{x\y) - logp{x\y) \ < 



4m^ 



by applying the mean valu e theorem to the function a 
k(x\y) and in eq. (10 1 we conclude 

e 

m 3 



(20) 

log a. From the definition of 

(21) 



D{p^ 



On the other hand, (|19|) implies 

' - 2 V4m3 
logp(y) - logp(y)| 



and hence 



- 2 V4m3 J 



(22) 



(23) 



Using ineqs. (21) and (22 1, ineq. (18 1 yields for all y 

|logp(y) -logr(2/)| < ^ . 
Let logqp(y) be given by discretizing all values logp{y) up to an accuracy of e/4. Then 

K{qp\pY,<^) = 0. 

On the other hand, let logqr(y) be given by discretizing all values logr(y) up to an 
accuracy of e/4. Then K{qr\r, e) = and thus 

K{qr\px\Y,S,Xo,e-) = 0. 



Due to (23 1, both discretizations coincide up to one bit for each value y, say b„i(y). To 
illustrate this, consider the binary strings 0.111 . . . and 1.000 . . . which can be arbitrarily 
close despite their truncation being different. We conclude that 

K{qp\px\Y,S,Xo,£,b,n) = 0. 

Let q be the distribution generated by log qp through normalization 

logg(y) := loggp - log^gp(y). 

y 

Due to the upper bound (23 1, Lemma [i] gives 

D{p\\q) < 2||logp(y) -loggp(y)|loo < £• 

The theorem now follows from Lemma |3] applied to z — px\Yy i 
Py and ip = e . □ 



{5,xo,e,bm) ,P 



The complexity of pY in the bound (16 1 will typically exceed the terms with m because 
we will need several bits for every bin to describe the corresponding probability (this will 
be discussed in Section [s] in more detail). Moreover, K{e) can be quite low, in particular 
if we choose e = 2^*^ for some k. Therefore, the mutual information between py and px\Y 
is almost as large as the complexity of py- This shows that the amount of adjustments 
required to mimic an additive noise model in the wrong direction depends essentially on 
the complexity of pY ■ In the following section we consider the complexity in the case in 
which Py is typical with respect to some known parametric family of distributions. 
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5 Kolmogorov complexity of distributions from a para- 
metric family 

The problem with applying Theorems [2] and |3] to real data is that the term K^{py) cannot 
be known due to the uncomputability of Kolmogorov complexity in general. Fortunately, 
we can prove statements about the increase of the complexity for decreasing e for typical 
elements of a family of distributions. This is shown by the following lemma: 

Lemma 5 (typical distributions in parametric families). 

Let pg be a parametric family of distributions over some finite probability space and run 
over a d-dimensional manifold A C . Moreover, let pg be computable in the following 
sense: there exists a program that computes pg{y) for any computable input 6. If the Fisher 
information matrix has full rank for all 9 € A, the complexity of a typical distribution pg 
grows logarithmically with decreasing e, i.e. for sufficiently small e 

Ke{pe) = - '^loge. 

Proof: Let Fg denote the Fisher information matrix of the parametric family and 9i, 62, 
. . . , 0Ar(j,) e A be the parameter vectors of all computable distributions pg that can be 
described with complexity K(pg) < k. 
For every 9j we have 3 

D{pg\\pg^) ^{9- 9,fFg^{9 ~ 9,) + 0{\\9 - 9,f) . (24) 

For sufficiently small e, the set of all 9 with D{pg \ \pg. ) < e is thus contained in the ellipsoid 

{9~9jfFg^{9-9j)<2e. 

The volume Vj of such an ellipsoid with respect to the Lebesque measure is given by 

jd/2 

v{d]2 + \y 



v. = {detFg.)-^/^^, -{26)"'^ 



This can be seen by transforming the ellipsoid into a sphere of radius \/2e via the linear 
map (FeJ-i/2. 

Now we check how the minimum number of disjoint ellipsoids must increase with e to 
cover at least a constant fraction of the parameter space A. Otherwise, if the total volume 
tends to zero it gets more and more unlikely that it contains a randomly chosen € A. 
We need to increase N{k) proportional to l/(2e)''/^ and k must increase with — |loge 
due to N{n) < 2*^. Hence we need asymptotically at least —{d/2) log2 e bits. 

To see that this is also sufficient, we consider a cube [0, A]'' 3 A that we divide into N 
equally sized cubes of side length A with middle points 9i, . . . ,9n such that 

i9-9,fFgi9^9,)<e/2 



for any point 9 in the same cube. By (24 1, this ensures for all 9,9j G A and sufficiently 
small e that D{pg\pg^) < e. If /i is an upper bound for all eigenvalues of all Fg it is 
sufficient to guarantee 

im a I|2 



e 

-111 < — 
^" - 2fi 



This can be achieved by choosing 
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Hence it is sufficient to choose the smallest N that satisfies 

d/2 



N > 



(-) 



and whose dth root is integer. The grid and thus every vector 9j can be computed from 
A and j and pg^ can be computed from 9j by assumption. Hence, 

□ 

We will now apply Lemma|5]to the family of all distributions py for which p{y) is bounded 
from below by some /? > 0. It is canonically parameterized by the first m — 1 probabilities 
if there are m possible y-values. Then we obtain: 

Corollary 1 (algorithmic mutual information for typical distributions). 

Letpxy he as in Theorem^ Further assume that py is typical in the family of distribu- 
tions on m-values whose probabilities are bounded from below by some (3 > 0. If I{X : E) 
satisfies the bound (15) with e — 2^^ for sufficiently large N, then 



I{PY ■■ Px\y) > ~Y~ log^ ~ 21ogm - m . 

Proof: One can check that the Fisher information matrix has full rank. Then the proof 
of the preceding lemma shows for sufficiently small e 

KeiPY) > ^loge. 

Plugging this into the lower bound of Theorem [s] together with e = concludes the 
proof. □ 

Hence, for typical py, the lower bound is positive if m and n are large enough. This 
asymptotic statement still holds true if py looks on a coarse-grained scale like some 
simple distribution qy, i.e., a Gaussian, but shows irregular deviations from qy if the 
probabilities are described more accurately. 

To give an impression on the amount of information between py\x and py that can 
be inferred after n-fold sampling, we recall that the mutual information between E and 
X can be estimated up to an accuracy of 0(l/n) |12j . The lowest upper bound on e 



in ineq. (151 that can be guaranteed by the observations thus is proportional to l/^/n. 
Hence, for constant m, the best lower bound on the amount of algorithmic information 
shared by py and px\Y increases logarithmically in n as long as the sample is not sufficient 
to reject independence between Y — f{X) and X. 



6 Applying the results to empiricial distributions 

In applying Theorems |2] and [3] to realistic situations, we still have the problem that we 
have no reason to believe that the true distribution is computable. On the other hand, 
applying the argument to the empirical distribution (which is, for large sampling close to 
an additive noise model) is still problematic because algorithmic dependencies between 
the empirical distribution qy and the empirical conditional qx\Y do not prove algorithmic 
dependencies between the true distributions py and px\Y- One reason is that every 
conditional probability qy\x{y\x) can always be written as a fraction with denominator 
qx{x)n, which already is an algorithmic dependence. 
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We now describe how to use Postulate [T| if only a finite list of {x, y)-pairs is observed 
and the underiying distribution is not known. Given samples 5„ — [{xi,yi), . . . , {xn, Dn)] , 
we can generate a non-empty subsample 5^(„) = [{xi,yi), . . . ,{xfi^n),ye{n))] with high 
probability such that every x- value occurs exactly £{n) /m-times. The samples 5*^ („) can 
then be used for the estimation of Py\x- Hereby, £{n) is chosen independently of the 
samples in a way that for n — > oo we have £{n) oo and the probability of obtaining 
5f(„) from Sn converges to one. 

Now by construction, if M contains no information about S, the empirical distribution 

of the subsample must not contain any information about the empirical distribution 

(n) 

of a;- values in the entire sample, i.e., 

Mx^Y:=HqP:q'^l;,^^)^0. (25) 



In the spirit of [S], we postulate that the violation of eq. (25) indicates that the causal 
hypothesis X ^ Y is wrong or the mechanisms generating x-values and the mechanisms 
generating y- values from values have not been generated independently. For a discussion 
of this case see [10] . Using this terminology, our goal is to derive a lower bound on My^x 
for the case where px.Y admits an additive noise model from X to Y. We can apply 
Theorem [3] to a distribution that is defined by the empirical results via 

p'{x,y) := 

which is necessarily computable because it only contains rationale values. 

We have already argued that the causal hypothesis Y ^ X would only be acceptable 

if 

:gW"))(x|y))«0. 

If the true distribution p almost admits an additive noise model from X to y in the sense 



of ineq. (15 1, the same inequality will also be satisfied by p' if n is sufficiently high and 
thus 

provided that ife((7|j'^), which coincides with K^(py) due to Lemma[2]for large n, is high. 



7 Approximating continuous variables with discrete 
ones 

Causal inference via additive noise models has been described and tested for continuous 
variables [5] . We have discussed the discrete case mainly for technical reasons because we 
were able to prove statements for distributions that are only close to additive noise models. 
Our results can easily be applied to the continuous case by discretization with increasing 
number of bins. As already mentioned, the discretized version of the empirical distribution 
becomes computable, which circumvents the problem that the true distribution may be 
uncomputable . 

Before we discuss the discretization in detail, we emphasize that there is a problem 
with applying Postulate [l] to the conditionals obtained after discretizing the variables: 
if we define a discrete variables X'-"^^ and by putting X and Y into m bins each, 

the discretized conditional py(m)|x(™) does not only depend on py\x- Instead, it also 
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contains information about the distribution of X. For this reason, algorithmic depen- 
dencies between and Pxc™) only disprove the causal hypothesis X ^ Y ii the 
binning is fine enough to guarantee that the discrete value a;(™) is sufficient to determine 
the conditional probability for y^™', i.e., the relevance of the exact value x is negligible if 
the discrete value is given. It is therefore essential that the argument below refers to the 
asymptotic case of infinitely small binning. 

To approximate a continuous density p{x, y) on by with increasing m :— 2k+l 
we consider the square 




for all odd m and replace p{x,y) with p(x^y\Q). We discretize Q into m x m bins of 
equal size, which defines a probability distribution over Z^-valued variables X^, and Ym, 
respectively. We define the function : Ira by putting the values f{A{j — 1/2)) 

with j — —k, . . . ,k to the corresponding bin. 

Moreover, appropriate smoothness asumptions on p(x, y) can guarantee that the mu- 
tual information between Ym — fm{Xm) and Xm converges to I{X : (Y — f{X))) for 
m oo. It is known |12j that there are estimators for mutual information that converge 
if the binning m is increased proportionally to i/n for sample size n oo. If p(x, y) ad- 
mits an additive noise model, i.e., I{X : {Y — f{X)) = 0, then I{Xm ■ {Ym — f{Xm)) 0. 
Hence, the discrete distributions on X^ and Y^ get arbitrarily close to discrete additive 
noise models. Applying Theorem |2] to these discrete distributions then yields algorithmic 
dependence between the discretized marginal and the discretized conditional. 

8 Conclusions 

We have discussed a causal inference method that prefers the causal hypothesis X ^ Y 
to Y ^ X whenever the joint distribution px,Y admits an additive noise model from X 
to Y and not vice versa. It seems that this way of reasoning assumes that all causal mech- 
anisms in nature can be described by additive noise models (which is certainly not the 
case). Here we argue that the method is nevertheless justified because it is unlikely that a 
causal mechanism that is not of the form of an additive noise model generates a distribu- 
tion that looks like an additive noise model in the wrong direction. This is because such a 
coincidence would require mutual adjustments between P(cause) and P(effect| cause). To 
measure the amount of tuning needed for this situation we have derived a lower bound 
on the algorithmic information shared by P(cause) and P(effect|cause). If we assume 
that "nature chooses" P(cause) and P(effect|cause) independently, a significant amount 
of algorithmic information is not acceptable. Our justification of additive-noise-model 
based causal discovery thus is an application of two recent proposals for using algorith- 
mic information theory in causal inference: |9] postulated that the shortest description of 
P(cause, effect) is given by separate descriptions of P(cause) and P(effect|cause), which 
would be violated then. [5] argued that algorithmic dependencies between any two ob- 
jects require a causal explanation. They consider the two mechanisms that determine 
P(cause) and P(effect|cause), respectively, as two objects and conclude that the absence 
of causal links on the level of the two mechanisms imply their algorithmic independence, 
in agreement with [9]. 
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